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FINAL  REPORT: 

Computer-Based  Measurement  of  Intellectual  Capabilities 


Objectives 


The  objectives  of  this  research  program  were  based  on  a  review  of  previous 
research  literature  that  identified  the  potential  of  computerized  adaptive  test¬ 
ing  to  reduce  at  least  five  kinds  of  errors  in  the  measurement  of  human  capaci¬ 
ties: 

1.  Errors  due  to  mismatch  of  test  item  difficulty  with  testee  ability; 

2.  Errors  due  to  the  psychological  effects  of  testing; 

3.  Errors  due  to  inappropriate  dimensionality; 

4.  Errors  due  to  failure  to  extract  sufficient  information  from  the  testee; 

5.  Errors  due  to  over-simplistic  conceptualizations  of  intellectual  capabili¬ 

ties. 

Within  the  context  of  these  five  sources  of  error,  which  act  to  reduce  the  pre¬ 
cision,  accuracy  and  utility  of  current  ability  testing  procedures,  the  research 
was  designed  to: 

1.  Extend  previous  research  efforts  to  identify  the  most  useful  computer-based 
adaptive  testing  strategies. 

2.  Study  the  psychological  effects  of  computerized  adaptive  testing,  to  iden¬ 
tify  those  testing  conditions  which  minimize  adverse  effects  and  maximize 
positive  effects. 

3.  Investigate  the  problem  of  intra-individual  multidimensionality  in  ability 
testing. 

4.  Examine  the  use  of  such  response  modes  as  probabilistic  responding  and 
free-response  methods  for  use  in  computerized  adaptive  testing  in  order  to 
extract  maximum  information  from  each  examinee's  response  to  each  test 
item. 


5.  Develop,  refine  and  evaluate  new  computer-administered  ability  tests  which 
measure  abilities  not  now  measurable  using  paper  and  pencil  ability  test¬ 
ing-  ,r  ' 

Research  in  pursuance  of  these  primary  objectives  began  in  September  1975  £5 

and  continued  through  December  1978.  A  contract  extension,  funded  by  the  Navy  U 

Personnel  Research  and  Development  Center,  was  designed  to  complete  a  live-test-  ^ 

ing  validity  comparison  of  adaptive  and  conventional  tests  using  Marine  re-  11  " 

cruits.  This  extension  continued  the  contract  through  September  1979.  Techni-  ’ 

cal  reports  were  completed  through  January  1983.  By _ _ 
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Approach 

The  major  focus  of  the  research  was  on  the  evaluation  of  adaptive  testing 
strategies  by  comparison  of  their  characteristics  with  each  other  and  with  con¬ 
ventional  tests.  Both  monte  carlo  simulation  and  live  testing  were  used  in 
these  studies.  In  Research  Report  75-6  the  stradaptive  testing  strategy  was 
examined  in  monte  carlo  simulation  to  evaluate  various  scoring  techniques  possi¬ 
ble  with  this  testing  strategy,  under  various  test  lengths  and  prior  information 
conditions.  Performance  of  the  stradaptive  testing  strategy  was  also  evaluated 
in  live  testing  (Research  Report  80-3)  by  comparing  its  validity  with  that  of  a 
conventional  test  and  a  Bayesian  adaptive  test. 

The  Bayesian  adaptive  testing  strategy  was  further  studied  in  several  re¬ 
ports.  Monte  carlo  simulation  was  used  in  Research  Report  76-1  to  examine  the 
performance  of  this  testing  strategy  under  several  item  pool  configurations  and 
at  a  number  of  test  lengths.  In  Research  Reports  80-5  and  83-1,  the  reliability 
and  validity  of  the  Bayesian  adaptive  test  was  compared  with  that  of  convention¬ 
al  tests  in  a  college  population  (80-5)  and  in  a  military  recruit  population 
(83-1).  Research  Report  77-4  describes  a  procedure  for  improving  the  efficiency 
of  item  selection  in  Bayesian  adaptive  testing. 

Several  other  problems  concerned  with  the  application  of  adaptive  tests  to 
the  measurement  of  abilities  were  discussed  in  a  symposium  presented  at  the  1976 
meeting  of  the  Military  Testing  Association  (Research  Report  77-1).  An  overview 
of  adaptive  testing  strategies,  presented  by  McBride,  included  a  discussion  of 
item  selection  strategies,  scoring  adaptive  tests,  and  problems  of  evaluating 
adaptive  tests.  The  problem  of  estimating  trait  status  in  adaptive  testing 
based  on  item  response  theory  approaches  was  presented  by  Sympson,  including  a 
comparison  of  the  characteristics  of  Bayesian  and  likelihood-based  estimates. 
Vale,  in  his  paper,  considered  the  problem  of  classifying  Individuals  into  dis¬ 
crete  ability  categories  (e.g.,  pass-fail);  his  monte  carlo  analysis  compared 
adaptive  and  conventional  tests  designed  for  making  dichotomous  classifications. 


The  effects  of  testing  conditions  on  test  performance  were  investigated  in 
a  number  of  live-testing  studies.  Since  computer-administered  testing  permits 
immediate  scoring  of  an  examinee's  answer  to  a  test  question,  it  becomes  possi¬ 
ble  to  inform  the  examinee  immediately  after  each  response  is  given  as  to  wheth¬ 
er  the  answer  was  correct  or  incorrect.  This  Immediate  knowledge  of  results,  or 
immediate  feedback,  was  investigated  in  several  studies  in  terms  of  its  effects 
on  ability  test  performance  in  adaptive  and  conventional  tests  (Research  Reports 
76-3  and  78-2),  its  interaction  with  test  difficulty  (Research  Report  78-2)  and 
computer  versus  self-paced  test  administration  (Research  Report  81-2),  and  its 
effects  on  examinees'  reactions  to  test  administration  (Research  Reports  76-4 
and  81-2).  Related  studies  examined  the  effects  of  time  limits  on  test-taking 
behavior  (Research  Report  76-2)  and  the  accuracy  of  the  perceived  difficulty  of 
test  items  (Research  Report  77-3). 

The  question  of  intra-individual  dimensionality  in  performance  on  ability 
tests  was  recast  within  the  more  general  framework  of  the  fit  of  individuals  to 
item  response  theory  (IRT)  models.  This  issue  was  examined  in  one  study  (Re¬ 
search  Report  79-7)  in  which  the  predicted  and  acutal  performance  of  single  in¬ 
dividuals  wsb  examined  for  indications  of  lack  of  person  fit  due  to  intra-indi- 


vidual  multidimensional! ty  or  other  factors  reflecting  non-fit  to  the  unidimen¬ 
sional  1RT  models. 


The  use  of  test  item  response  modes  other  than  the  multiple-choice  item  was 
examined  in  one  study  (Research  Report  77-2)  which  compared  test  information 
derived  from  free-response  administration  to  that  of  the  same  items  administered 
in  multiple-choice  mode. 

The  use  of  the  unique  capability  of  interactive  computers  to  measure  abili¬ 
ties  not  measurable  by  paper-and-pencll  tests  was  examined  in  one  study  (Re¬ 
search  Report  80-2).  An  interactive  spatial  reasoning  test  was  designed  based 
on  the  popular  "15  puzzle”  in  which  examinees  were  required  to  restructure  a  set 
of  15  numerals  into  a  target  pattern  using  a  minimum  number  of  moves.  Examinee 
performance  on  the  test  was  analyzed  in  terms  of  such  factors  as  number  of  moves 
to  solution,  quality  of  the  moves,  and  response  latencies  at  each  point  in  the 
testing  procedure. 


Major  Findings 

The  major  findings  below  are  generally  organized  according  to  the  original 
objectives  of  the  research  program.  Additional  details  are  in  the  Research  Re¬ 
port  abstracts.  Many  of  the  original  Research  Reports  contain  additional  impor¬ 
tant  f indigs. 

Adaptive  Testing  Strategies 

1.  Monte  carlo  data  comparing  the  stradaptive  test  with  non-adaptlve  approach¬ 
es  to  ability  testing  (Research  Report  75-6)  shows  that  the  stradaptive 
test  provides  more  equipredse  measurement  than  a  peaked  conventional  test. 
As  item  discriminations  increased,  the  equi precision  of  the  stradaptive 
test  increased  relative  to  that  of  the  conventional  test. 

2.  A  stradaptive  test  with  an  average  of  25%  fewer  items  than  a  conventional 
test  obtained  significantly  higher  validities  with  a  college  grade-point 
average  criterion  than  did  the  conventional  test  (Research  Report  80-3). 

3.  Monte  carlo  evaluation  of  a  Bayesian  adaptive  testing  strategy  identified  a 
number  of  psychometric  problems  in  the  ability  estimates  resulting  from 
this  testing  strategy  (Research  Report  76-1).  Bayesian  ability  estimates 
were  highly  correlated  with  test  length,  were  non-linearly  biased  for  about 
two-thirds  of  the  ability  range,  and  were  dependent  on  the  prior  ability 
estimate. 

4.  Although  the  monte  carlo  simulations  of  the  Bayesian  adaptive  test  identi¬ 
fied  these  potential  problems  with  the  Bayesian  ability  estimates,  they 
appeared  to  have  little  Impact  on  the  reliability  and  validity  of  Bayesian 
ability  estimates.  Live-testing  studies  of  the  Bayesian  adaptive  testing 
strategy  in  a  college  population  showed  validities  equal  to  that  a  conven¬ 
tional  test  (Research  Report  80-3),  and  high  reliabilities  for  tests  of  2 
to  30  items  in  length  (Research  Report  80-5);  in  the  latter  study,  however, 
using  a  concurrent  validity  criterion,  the  conventional  test  had  higher 
validity  correlations  than  the  adaptive  test.  In  a  military  recruit  popu- 


lation  (Research  Report  83-1),  the  Bayesian  adaptive  test  achieved  both 
higher  validities  and  higher  reliabilities  than  did  a  comparable  conven¬ 
tional  test.  In  this  population,  a  9-item  adaptive  test  achieved  the  same 
reliability  as  a  17-item  conventional  test;  10-  to  11-item  adaptive  tests 
achieved  the  same  concurrent  validities  as  28-  to  30-item  conventional 
tests. 

5.  The  original  form  of  the  Bayesian  adaptive  test  used  an  item-search  proce¬ 
dure  that  could  require  excessive  amounts  of  computing  time  for  an  interac¬ 
tive  test  administration  environment.  A  rapid  item-search  procedure  was 
developed  and  shown  to  select  the  same  subset  of  items  as  the  original  pro¬ 
cedure  in  about  one-tenth  the  amount  of  computer  time. 

6.  Different  methods  of  estimating  ability  from  adaptive  tests  have  different 
characteristics.  Validities  in  the  prediction  of  college  grade-point  aver¬ 
ages  from  a  stradaptive  test  were  higher  for  ability  estimates  not  based  on 
1RT  methods  than  they  were  for  IRT-based  ability  estimates  (Research  Report 
80-3).  Within  the  IRT  methods  for  estimating  ability,  Bayesian  methods  are 
slightly  order  dependent,  resulting  in  slightly  different  ability  estimates 
with  the  same  items  administered  in  different  orders  (Sympson,  in  Research 
Report  77-1).  Bayesian  ability  estimates  also  have  different  psychometric 
characteristics  than  do  estimates  based  on  maximum  likelihood  procedures. 

7.  Adaptive  tests  can  be  used  for  classification  purposes  as  well  as  for  mea¬ 
surement  on  a  continuous  scale.  When  compared  to  conventional  tests  de¬ 
signed  to  make  classifications,  adaptive  tests  can  classify  more  accurately 
than  conventional  tests  when  it  is  necessary  to  make  more  than  a  single 
dichotomous  classification  based  on  test  scores  (Vale,  in  Research  Report 
77-1). 

Test  Administration  Conditions 

8.  An  analysis  of  response  latency  data  showed  that  testees  approach  different 
testing  procedures  in  different  ways  (Research  Report  76-2).  The  response 
latency  data  suggest  that  these  different  test-taking  styles  and  strategies 
might  be  potentially  useful  as  moderator  or  predictor  variables  in  the  pre¬ 
diction  of  external  criteria. 

9.  Computer-administered  feedback  (immediate  knowledge  of  results)  on  a  con¬ 
ventional  test  appears  to  result  in  enhanced  ability  test  performance  for 
testeees  of  all  ability  levels  (Research  Report  76-3).  Under  computer-ad¬ 
ministered  feedback  conditions,  mean  test  scores  were  significantly  higher 
for  both  high-  and  low-ability  testees.  Ninety  percent  of  college  students 
favorably  evaluated  their  experience  with  computer-administered  feedback 
(Research  Report  76-4). 

10.  Adaptive  tests  appear  to  be  more  intrinsically  motivating  for  low-ability 
testees  (Research  Report  76-4),  and  result  in  higher  ability  estimates  (Re¬ 
search  Report  76-3),  than  similarly  administered  conventional  tests.  This 
suggests  that  adaptive  testing  might  eliminate  some  of  the  undesirable  psy¬ 
chological  effects  characteristic  of  conventional  testing  procedures,  re¬ 
sulting  in  fairer  and  more  accurate  test  scores  for  testees  who  typically 


obtain  low  scores  on  conventional  ability  tests 


11.  Item-difficulty  perceptions  of  college  students  were  highly  related  to  ob¬ 
jective  indices  of  test  item  difficulty  (Research  Report  77-3).  This  sug¬ 
gests  that  test  difficulty ,  which  may  differ  between  conventional  and  adap¬ 
tive  tests  for  examinees  of  the  same  ability,  might  be  an  Important  factor 
affecting  the  test  performance  of  individuals. 

12.  Test  difficulty  interacted  with  immediate  knowledge  of  results  to  produce 
effects  on  ability  estimates,  but  not  on  psychological  reactions  to  the 
testing  conditions  (Research  Report  78-2).  Since  difficulty  is  more  equal 
across  ability  levels  in  an  adaptive  test  than  in  a  conventional  test, 
these  results  suggest  that  the  testing  environment  of  adaptive  tests  will 
result  in  fewer  sources  of  error  in  ability  estimates  than  will  convention¬ 
al  ability  tests. 

Other  Findings 

13.  Analysis  of  person-fit  data  derived  from  the  person  response  curve  indicat¬ 
ed  that  the  vast  majority  of  college  students  studied  responded  to  a  set  of 
test  items  in  accordance  with  the  3-parameter  logistic  1RT  model  (Research 
Report  79-7).  The  person  response  curve  approach  also  identified  a  small 
group  of  individuals  whose  responses  to  the  test  items  appeared  to  result 
from  an  underlying  multidimensional  ability  structure  with  respect  to  the 
ability  domain  studied. 

14.  The  dependence  of  adaptive  testing  on  the  multiple-choice  item  will  result 
in  test  scores  with  less  than  optimal  properties.  Analysis  of  free-re- 
sponse  item  data  Indicates  that  more  informative  ability  estimates  can  be 
derived  from  free  response  items  than  from  the  same  items  administered  as 
multiple-choice  items  and  scored  by  optimal  IRT  methods;  differences  were 
greater  for  high-ability  examinees  (Research  Report  77-2). 

15.  Interactive  computer  administration  of  ability  test  items  permits  the  de¬ 
sign  and  implementation  of  ability  tests  using  novel  item  formats,  which 
may  extend  the  range  of  measurable  abilities  beyong  those  now  measurable 
using  a  dimensional  approach.  The  design  and  implementation  of  an  interac¬ 
tive  spatial  problem-solving  test  (Research  Report  80-2)  permitted  the  mea¬ 
surement  and  analysis  of  a  number  of  problem-solving  types  of  variables 
that  described  individual  differences  in  problem-solving  styles;  these 
variables  might  be  useful  as  ability  kinds  of  variables,  following  further 
study  and  refinement. 

Implications  for  Further  Research 

The  findings  and  experience  of  this  research  program  support  the  feasibili¬ 
ty,  utility  and  psychometric  advantages  of  computerized  adaptive  measurement  of 
intellectual  capabilities.  However,  many  new  questions  were  raised  by  the  re¬ 
search,  and  some  of  the  original  questions  addressed  are  still  in  need  of  fur¬ 
ther  research. 


Research  has  concentrated  on  comparison  of  the  stradaptlve  and  Bayesian 
adaptive  testing  strategies  with  conventional  tests.  Further  research  is  needed 
(1)  comparing  these  strategies  directly  with  each  other,  in  both  live  testing 
and  in  simulation,  and  (2)  in  comparing  these  strategies  with  other  adaptive 
testing  strategies,  such  as  an  information-based  item  selection  routine. 

All  adaptive  testing  strategy  comparisons  to  date  that  used  monte  carlo 
simulation  techniques  have  made  two  assumptions  that  are  not  characteristic  of 
real  data.  First,  they  have  assumed  that  the  item  pool  is  characterized  by 
items  with  known  parameter  values.  In  real  item  pools,  however,  item  parameter 
values  are  never  known,  but  are  always  estimated.  These  estimates  are  only  ap¬ 
proximations  to  the  true  values  and,  as  a  consequence,  contain  some  degree  of 
error,  with  rather  substantial  degrees  of  error  for  some  of  the  item  parameters. 
Since  adaptive  testing  strategies  are  designed  to  explicitly  select  items  based 
on  these  item  parameter  estimates,  the  possibility  exists  that  in  a  real  item 
pool  with  error-laden  item  parameters  adaptive  tests  might  perform  less  optimal¬ 
ly  due  to  the  error  in  the  item  parameter  estimates.  Thus,  simulation  studies 
should  be  designed  and  implemented  to  experimentally  vary  the  degrees  of  error 
in  item  parameter  estimates  and  to  evaluate  the  effects  of  these  errors  on  the 
performance  of  adaptive  testing  strategies,  in  order  to  identify  the  effects  of 
these  errors  on  the  performance  of  the  testing  strategies. 

A  second  assumption  made  in  all  monte  carlo  comparisons  of  adaptive  testing 
strategies  is  that  the  item  pool  is  strictly  unidimensional,  since  only  one  set 
of  item  parameter  values  is  used  for  each  item.  In  real  data,  however,  item 
pools  are  very  rarely  strictly  unidimensional.  Frequently,  item  pools  are  char¬ 
acterized  by  second  and  succeeding  factors  that  account  for  from  trivial  por¬ 
tions  of  the  item  pool  variance  to  substantial  portions  of  that  variance.  While 
multidimensional  IRT  models  have  not  yet  been  sufficiently  operationalized  to 
permit  the  estimation  of  item  parameters  for  dimensions  beyond  the  first,  it  is 
possible  to  examine  the  effects  of  multidimensionality  on  adaptive  testing 
strategies.  One  approach  to  studying  this  problem  is  to  simulate  the  adminis¬ 
tration  of  adaptive  testing  strategies  with  unidimensional  item  parameters  when 
item  responses  are  generated  from  an  underlying  multidimensional  structure. 

This  approach  assumes  that  the  dimensionality  of  the  item  responses  is  the  true 
underlying  multidimensional  structure,  while  the  apparent  unidimensionality  of 
the  item  pool  is  the  result  of  the  item  parameterization  process  applied  to  it. 
Studies  of  this  type  would  enable  the  identification  of  the  degrees  and  types  of 
multidimensionality  that  could  be  tolerated  by  the  various  adaptive  testing 
strategies  without  serious  degradation  of  their  performance. 

Further  live-testing  comparisons  of  adaptive  testing  strategies  are  also 
necessary.  The  four  live-testing  studies  completed  under  this  contract  yielded 
somewhat  conflicting  results.  In  two  of  the  four  studies,  adaptive  tests  ob¬ 
tained  higher  validities  than  conventional  tests  with  a  smaller  average  number 
of  items,  and  in  one  study  with  a  smaller  median  number  of  items.  In  the  study 
using  military  recruits  a  very  clear  advantage  was  obvious  for  the  adaptive 
tests  beginning  at  short  test  lengths.  When  a  large  group  of  college  students 
was  studied,  however,  although  the  expected  differences  in  reliability  were  ob¬ 
tained,  the  conventional  test  performed  better  on  the  concurrent  validity  crlte- 


rion.  Since  the  design  of  the  two  large-sample  studies  was  similar,  differences 
in  results  could  be  attributable  to  differences  in  the  examinees,  the  item 
pools,  or  the  criterion  tests.  Additional  live-testing  studies  are  needed  to 
evaluate  the  effects  of  these  conditions,  as  well  as  to  evaluate  the  performance 
of  other  adaptive  testing  strategies  and  to  evaluate  their  performance  with  ad¬ 
ditional  criterion  variables. 

Test  Administration  Conditions 

The  research  results  show  that  a  number  of  test  administration  variables 
influence  test  scores,  IRT-based  ability  estimates,  and/or  examinees'  reactions 
to  tests.  These  include  test  speededness,  test  difficulty,  and  immediate  feed¬ 
back  to  examinees  as  to  whether  their  item  responses  are  correct  or  incorrect. 
Testing  strategy  (adaptive  versus  conventional)  also  had  some  effects  on  test 
performance  and  reactions,  probably  due  to  the  differing  difficulties  of  adap¬ 
tive  and  conventional  tests.  Immediate  feedback  of  results  appeared  to  be  an 
important  potential  factor  in  increasing  test-taking  motivation  and  improving 
test  scores. 

Studies  completed  on  the  effects  of  test  administration  conditions  have  all 
utilized  volunteer  college  students  as  examinees  and  have  used  verbal  ability 
items  in  the  tests  administered.  Since  the  test-taking  motivation  of  volunteer 
students  might  differ  when  tested  under  conditions  where  the  tests  are  being 
used  for  grading  or  other  purposes,  future  studies  should  examine  the  effects  of 
test  administration  conditions  when  the  tests  being  administered  are  to  be  used 
for  purposes  other  than  research.  In  addition,  the  generality  of  the  observed 
effects  should  be  studied  on  populations  other  than  college  students,  and  using 
other  tests  in  addition  to  verbal  ability  tests.  Further  studies  should  also 
Include  the  effects  of  other  adaptive  testing  strategies  as  test  administration 
conditions,  in  conjunction  with  immediate  knowledge  of  results. 

Intra-Individual  Dimensionality,  Response  Modes,  and  New  Abilities 

Research  in  these  three  areas  was  only  begun  during  the  contract  period. 

The  person  characteristic  curve  results  show  that  the  vast  majority  of  the  one 
group  of  college  students  studied  responded  to  a  set  of  test  items  in  accordance 
with  the  three-parameter  logistic  IRT  model.  A  small  group  of  students  was 
identified,  however,  whose  responses  appeared  to  be  reliably  divergent  from  that 
model.  These  deviations  were  ascribed  to  intra-individual  multidimensionality. 
Since  the  person  response  curve  method  was  used  in  only  this  one  study,  further 
studies  are  Indicated.  Of  importance  is  the  performance  in  monte  carlo  simula¬ 
tions  of  the  person-fit  indices  under  conditions  of  unidimensionality,  the  de¬ 
rivation  of  appropriate  sampling  distributions  of  the  person-fit  indices,  the 
evaluation  of  alternate  person-fit  indices,  and  the  effect  of  test  structure 
characteristics  (e.g. ,  distributions  of  item  characteristics)  on  the  performance 
of  person-fit  indices.  Additional  live-testing  studies  should  also  be  imple¬ 
mented  to  study  the  effects  of  various  test  administration  conditions  (e.g. , 
Interruptions,  poor  testing  conditions,  immediate  knowledge  of  results)  on  in¬ 
traindividual  dimensionality  by  means  of  the  person  response  curve  and  assoc¬ 
iated  indices  of  person  fit. 

Failure  to  extract  sufficient  information  from  an  examinee's  responses  to 


multiple-choice  test  items  can  lower  the  quality  of  obtained  measurements.  The 
one  study  completed  on  this  problem  indicated  that  the  use  of  free-response 
items  was  able  to  improve  the  measurement  precision  of  a  set  of  vocabulary  items 
beyond  that  possible  from  scoring  the  same  items  as  polychotomous  multiple- 
choice  items.  Both  of  these  administration/ scoring  modes  provide  better  mea¬ 
surement  than  dichotomously-scored  multiple-choice  items.  Since  this  study  used 
college  students  on  a  single  short  vocabulary  test,  further  studies  are  obvious¬ 
ly  needed  to  examine  the  generality  of  the  results.  In  addition,  research  is 
needed  to  examine  the  performance  of  other  alternatives  to  the  dichotomously- 
scored  multiple-choice  item  such  as  probabilistic  responding,  which  are  now  fea¬ 
sible  when  administered  by  interactive  computers. 

Interactive  computer  administration  of  ability  tests  makes  possible  the 
development  of  a  wide  range  of  new  kinds  of  ability  tests  to  supplement  the 
standard  dimensionality-based  tests  currently  in  use.  This  project  has  demon¬ 
strated  that  interactive  administration  of  a  problem-solving  type  of  test  can 
result  in  substantial  amounts  of  new  kinds  of  data  on  examinees  in  addition  to 
the  traditional  number  of  items  answered  correctly.  These  data  can  Include  ) 
formation  on  problem-solving  styles  and  response  latencies  that  might  be  inc* 
ative  of  other  individual  differences  problem-solving  variables.  Future  re¬ 
search  should  investigate  the  psychometric  characteristics  of  these  variabl 
including  their  reliabilities  and  their  contributions  to  validity,  as  well 
examine  the  utility  of  the  interactive  computer  for  measuring  other  ablllti  <s 
such  as  spatial,  perceptual,  and  memory  abilities  which  are  now  possible  to 
measured  by  computer  administration. 
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Research  Report  75-6 

A  Simulation  Study  of  Stradaptlve  Ability  Testing 
C.  David  Vale  and  David  J.  Weiss 
December  1975 

A  conventional  test  and  two  forms  of  a  stradaptive  test  were  administered  to 
thousands  of  simulated  subjects  by  minicomputer.  Characteristics  of  the  three 
tests  using  several  scoring  techniques  were  investigated  while  varying  the  dis¬ 
criminating  power  of  the  items,  the  lengths  of  the  tests,  and  the  availability 
of  prior  information  about  the  testee’s  ability  level.  The  tests  were  evaluated 
in  terms  of  their  correlations  with  underlying  ability,  the  amount  of  informa¬ 
tion  they  provided  about  ability,  and  the  equiprecision  of  measurement  they  ex¬ 
hibited.  Major  findings  were  (1)  scores  on  the  conventional  test  correlated 
progressively  less  with  ability  as  item  discriminating  power  was  increased 
beyond  0  ”  1.0;  (2)  the  conventional  test  provided  increasingly  poorer  equiprec¬ 
ision  of  measurement  as  items  became  more  discriminating;  (3)  these  undesirable 
characteristics  were  not  characteristic  of  scores  on  the  stradaptive  test;  (4) 
the  stradaptive  test  provided  higher  score-ability  correlations  than  the  conven¬ 
tional  test  when  item  discriminations  were  high;  (5)  the  stradaptive  test  pro¬ 
vided  more  information  and  better  equiprecision  of  measurement  than  the  conven¬ 
tional  test  when  test  lengths  and  item  discriminations  were  the  same  for  the  two 
strategies;  (6)  the  use  of  valid  prior  ability  estimates  by  stradaptive  strate¬ 
gies  resulted  in  scores  which  had  better  measurement  characteristics  than  scores 
derived  from  a  fixed  entry  point;  (7)  a  Bayesian  scoring  technique  implemented 
within  the  stradaptive  testing  strategy  provided  scores  with  good  measurement 
characteristics;  and  (8)  further  research  is  necessary  to  develop  improved  flex¬ 
ible  termination  criteria  for  the  stradaptive  test.  (AD  A020961) 


Research  Report  76-1 

Some  Properties  of  a  Bayesian  Adaptive  Ability  Testing  Strategy 
James  R.  McBride  and  David  J.  Weiss 
March  1976 

Four  monte  carlo  simulation  studies  of  Owen’s  Bayesian  sequential  procedure  for 
adaptive  mental  testing  were  conducted.  Whereas  previous  simulation  studies  of 
this  procedure  have  concentrated  on  evaluating  it  in  terms  of  the  correlation  of 
its  test  scores  with  simulated  ability  in  a  normal  population,  these  four  stud¬ 
ies  explored  a  number  of  additional  properties,  both  in  a  normally  distributed 
population  and  in  a  distribution-free  context.  Study  1  replicated  previous 
studies  with  finite  item  pools,  but  examined  such  properties  as  the  bias  of  es¬ 
timate,  mean  absolute  error,  and  correlation  of  test  length  with  ability.  Stud¬ 
ies  2  and  3  examined  the  same  variables  in  a  number  of  hypothetical  infinite 
item  pools,  investigating  the  effects  of  item  discriminating  power,  guessing, 
and  \ariable  vs.  fixed  test  length.  Study  4  investigated  some  properties  of  the 
Bayesian  test  scores  as  latent  trait  estimators,  under  three  different  item  pool 
configurations  (regressions  of  item  discrimination  on  item  difficulty).  The 
properties  of  interest  included  the  regression  of  latent  trait  estimates  on  ac¬ 
tual  trait  levels,  the  conditional  bias  of  such  estimates,  the  information  curve 


of  the  trait  estimates,  and  the  relationship  of  test  length  to  ability  level. 
The  results  of  these  studies  Indicated  that  the  ability  estimates  derived  from 
the  Bayesian  test  strategy  were  highly  correlated  with  ability  level.  However, 
the  ability  estimates  were  also  highly  correlated  with  number  of  items  adminis¬ 
tered,  were  non-linearly  biased,  and  provided  measurements  which  were  not  of 
equal  precision  at  all  levels  of  ability.  (AD  A022964) 


Research  Report  76-2 

Effects  of  Time  Limits  on  Test-Taking  Behavior 
T.  W.  Miller  and  David  J.  Weiss 
April  1976 

Three  related  experimental  studies  analyzed  rate  and  accuracy  of  test  response 
under  time-limit  and  no-time-limit  conditions.  Test  instructions  and  multiple- 
choice  vocabulary  items  were  administered  by  computer.  Student  volunteers  re¬ 
ceived  monetary  rewards  under  both  testing  conditions.  In  the  first  study,  col¬ 
lege  students  were  blocked  into  high-  and  low-ability  groups  on  the  basis  of 
pretest  scores.  Results  for  both  ability  groups  showed  higher  response  rates 
under  time-limit  conditions  than  under  no-time-limit  conditions.  There  were  no 
significant  differences  between  the  time-limit  and  no-time-limit  accuracy 
scores.  Similar  results  were  obtained  in  a  second  study  in  which  each  student 
received  both  time-limit  and  no-time-limit  conditions.  In  a  third  study  each 
testee  received  the  same  testing  condition  twice,  and  higher  response  rates  were 
observed  under  the  time-limit  condition;  response  accuracy  remained  consistent 
across  testing  conditions.  All  three  studies  showed  essentially  zero  correla¬ 
tions  between  response  rate  and  response  accuracy.  Response  latency  data  were 
also  analyzed  in  the  three  studies.  These  data  suggested  the  existence  of  dif¬ 
ferent  test-taking  styles  and  strategies  under  time-limit  and  no-time-llmit 
testing  conditions.  The  results  of  these  studies  suggest  that  number-correct 
scores  from  time-limit  tests  are  a  complex  function  of  response  rate,  response 
accuracy,  test-taking  style  and  test-taking  strategy,  and  therefore  are  not 
likely  to  be  as  valid  or  as  useful  as  number-correct  scores  from  no-time-limit- 
tests.  (AD  A024422) 


Research  Report  76-3 

Effects  of  Immediate  Knowledge  of  Results 
and  Adaptive  Testing  on  Ability  Test  Performance 
Nancy  E.  Betz  and  David  J.  Weiss 
June  1976 

This  study  investigated  the  effects  of  immediate  knowledge  of  results  (KR)  con¬ 
cerning  the  correctness  or  incorrectness  of  each  item  response  on  a  computer-ad- 
ministered  test  of  verbal  ability.  The  effects  of  KR  were  examined  on  a  50-item 
conventional  test  and  a  stradaptive  ability  test  and  in  high-  and  low-ability 
groups.  The  primary  dependent  variable  was  maximum  likelihood  ability  estimates 
derived  from  the  item  responses.  Results  indicated  that  mean  test  scores  for 
the  High-Ability  group  receiving  KR  were  higher  than  for  the  No-KR  group  on  both 
the  conventional  and  stradaptive  tests.  For  Low-Ability  examinees,  mean  scores 
were  higher  under  KR  conditions  than  under  No-KR  conditions  on  both  tests,  but 
the  difference  was  statistically  significant  only  for  the  conventional  test. 


However,  the  higher  wean  scores  of  the  Low-Ability  testees  on  the  stradaptlve 
test  Indicated  that  for  low-ability  examinees,  adaptive  testing  had  the  same 
effects  on  test  performance  as  did  the  provision  of  immediate  KR.  Knowledge  of 
results  did  not  have  significant  effects  on  either  response  latencies,  response 
consistency  on  the  stradaptlve  test,  or  the  internal  consistency  reliability  of 
the  conventional  test.  No  significant  score  '  'fferences  were  found  on  a  44-item 
post-test  administered  without  KR,  indicating  that  the  facilitative  effects  of 
knowledge  of  results  on  test  performance  were  confined  to  the  test  in  which  KR 
was  provided.  The  results  of  the  study  were  interpreted  as  indicating  the  po¬ 
tential  of  both  immediate  knowledge  of  results  and  adaptive  testing  procedures 
to  increase  the  extent  to  which  ability  tests  measure  "maximum  performance"  lev¬ 
els.  (AD  A027147) 


Research  Report  76-4 

Psychological  Effects  of  Immediate  Knowledge  of 
Results  and  Adaptive  Ability  Testing 
Nancy  E.  Bets  and  David  J.  Weiss 
June  1976 

This  study  investigated  the  effects  of  providing  immediate  knowledge  of  results 
(KR)  and  adaptive  testing  on  test  anxiety  and  test-taking  motivation.  Also 
studied  was  the  accuracy  of  student  perceptions  of  the  difficulty  of  adaptive 
and  conventional  tests  administered  with  or  without  immediate  knowledge  of  re¬ 
sults.  Testees  were  350  college  students  divided  into  high-  and  low-ability 
groups  and  randomly  assigned  to  one  of  four  test  strategies  by  KR  conditions. 

The  ability  level  of  examinees  was  found  to  be  related  to  their  reported  levels 
of  motivation  and  to  differences  in  reported  motivation  under  the  different 
testing  conditions.  Low-ability  examinees  reported  significantly  higher  levels 
of  motivation  on  the  stradaptlve  test  than  on  the  conventional  test,  while  the 
reported  motivation  of  high-ability  examinees  did  not  differ  as  a  function  of 
ability  level.  Low-ability  testees  reported  lower  motivation  with  KR  than  with¬ 
out  KR,  while  higher  ability  testees  reported  higher  motivation  with  KR.  Analy¬ 
sis  of  the  anxiety  data  indicated  that  students  reported  significantly  higher 
levels  of  anxiety  on  the  stradaptlve  test  than  on  the  conventional  test.  The 
provision  of  KR  did  not  result  in  significant  differences  in  reported  anxiety. 
However,  highest  levels  of  anxiety  were  reported  by  the  low-ability  group  on  the 
stradaptlve  test  administered  with  KR.  These  results,  in  conjunction  with  pre¬ 
viously  reported  data  on  effects  of  KR  on  ability  test  performance,  were  inter¬ 
preted  as  being  the  result  of  facilitative  anxiety.  Students  were  able  to  per¬ 
ceive  the  relative  difficulty  of  test  items  with  some  accuracy.  However,  per¬ 
ceptions  of  the  relative  degree  of  test  difficulty  were  much  more  closely  relat¬ 
ed  to  actual  test  score  on  the  conventional  test  than  on  the  stradaptlve  test. 
Over  90Z  of  the  students  reacted  favorably  to  the  provision  of  immediate  KR. 
These  results  suggest  that  adaptive  testing  creates  a  psychological  environment 
for  testing  which  is  more  equivalently  motivating  for  examinees  of  all  ability 
levels  and  results  in  a  greater  standardization  of  the  test-taking  environment, 
than  does  conventional  testing.  (AD  A027170) 
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Research  Report  77-1 

Applications  of  Computerized  Adaptive  Testing 
James  R.  McBride,  James  B.  Sympson, 

C.  David  Vale,  Steven  M.  Pine,  and  Isaac  I.  Be jar 
Edited  by  David  J.  Weiss 
March  1977 

This  symposium  consisted  of  five  papers: 

1.  James  R.  McBride:  A  Brief  Overview  of  Adaptive  Testing 

Adaptive  testing  is  defined,  and  some  of  its  item  selection  and  scoring 
strategies  briefly  discussed.  Item  response  theory,  or  item  characteristic 
curve  theory,  which  is  useful  for  the  Implementation  of  adaptive  testing  is 
briefly  described.  The  concept  of  "information"  in  a  test  is  introduced 
and  discussed  in  the  context  of  both  adaptive  and  conventional  tests.  The 
advantages  of  adaptive  testing,  in  terms  of  the  nature  of  information  it 
provides,  are  described. 

2.  James  B.  Sympson:  Estimation  of  Latent  Trait  Status  in  Adaptive  Testing 
Procedures 

The  role  of  latent  trait  theory  in  measurement  for  criterion  prediction  and 
in  criterion-referenced  measurement  is  explicated.  It  is  noted  that  latent 
trait  models  allow  both  normed-referenced  and  criterion-referenced  inter¬ 
pretations  of  test  performance.  Using  a  3-parameter  logistic  test  model, 
an  example  of  sequential  estimation  in  a  20-item  adaptive  test  is  present¬ 
ed.  After  each  item  is  administered,  four  different  ability  estimates  (two 
likelihood-based  and  two  Bayesian  estimates)  are  calculated.  Characteris¬ 
tics  of  the  four  estimation  methods  are  discussed.  The  information  avail¬ 
able  in  the  items  selected  by  the  adaptive  test  is  compared  with  the  infor¬ 
mation  available  from  application  of  latent  trait  theory,  and  adaptive 
testing  is  advocated  as  a  useful  approach  to  human  assessment. 

3.  C.  David  Vale:  Adaptive  Testing  and  the  Problem  of  Classification 

The  use  of  adaptive  testing  procedures  to  make  ability  classification  deci¬ 
sions  (i.e. ,  cutting  score  decisions)  is  discussed.  Data  from  computer 
simulations  comparing  conventional  testing  strategies  with  an  adaptive 
testing  strategy  are  presented.  These  data  suggest  that,  although  a  con¬ 
ventional  test  is  as  good  as  an  adaptive  test  when  there  is  one  cutting 
score  at  the  middle  of  the  distribution  of  ability,  an  adaptive  test  can 
provide  better  classification  decisions  when  there  is  more  than  one  cutting 
score.  Some  utility  considerations  are  also  discussed. 

4.  Steven  M.  Pine:  Applications  of  Item  Characteristic  Curve  Theory  to  the 
Problem  of  Test  Bias 

It  is  argued  that  a  major  problem  in  current  efforts  to  develop  less  biased 
tests  is  an  over-reliance  on  classical  test  theory.  Item  characteristic 
curve  (ICC)  theory,  which  is  based  on  individual  rather  than  group-oriented 
measurement,  is  offered  as  a  more  appropriate  measurement  model.  A  defini¬ 
tion  of  test  bias  based  on  ICC  theory  is  presented.  Using  this  definition, 
several  empirical  tests  for  bias  are  presented  and  demonstrated  with  real 
test  data.  Additional  applications  of  ICC  theory  to  the  problem  of  test 
bias  are  also  discussed. 


5.  Isaac  I.  Bejar:  Applications  of  Adaptive  Testing  in  Measuring  Achievement 
and  Performance 

The  paper  reviews  two  relatively  recent  developments  in  psychometric 
theory— the  assessment  of  partial  knowledge  and  research  in  adaptive  test¬ 
ing.  It  is  argued  that  the  use  of  non-dichotomous  item  formats,  needed  for 
the  assessment  of  partial  knowledge,  and  now  made  possible  by  the  adminis¬ 
tration  of  achievement  test  items  on  interactive  computers,  should  result 
in  achievement  test  scores  which  are  a  more  realistic  and  precise  indica¬ 
tion  of  what  a  student  can  do. 

(AD  A038114) 


Research  Report  77-2 

A  Comparison  of  Information  Functions  of  Multiple-Choice 
and  Free-Response  Vocabulary  Items 
C.  David  Vale  and  David  J.  Weiss 
April  1977 

Twenty  multiple-choice  vocabulary  items  and  20  free-response  vocabulary  items 
were  administered  to  660  college  students.  The  free-response  items  consisted  of 
the  stem  words  of  the  multiple-choice  items.  Testees  were  asked  to  respond  to 
the  free-response  items  with  synonyms.  A  computer  algorithm  was  developed  to 
transform  the  numerous  fre e-res ponses  entered  by  the  testees  into  a  manageable 
number  of  categories.  The  multiple-choice  and  the  free-response  items  were  then 
calibrated  according  to  Bock's  polychotomous  logistic  model.  One  item  was  dis¬ 
carded  because  of  extremely  poor  fit  with  the  model,  and  test  information  func¬ 
tions  were  determined  from  the  other  19  items.  Higher  levels  of  information 
were  obtained  from  the  free-response  items  over  most  of  the  range  of  abilities 
between  6  -  -3.0  to  0  -  +3.0. 


Research  Report  77-3 

Accuracy  of  Perceived  Test-Item  Difficulties 
J.  Stephen  Prestwood  and  David  J.  Weiss 
May  1977 

This  study  Investigated  the  accuracy  with  which  testees  perceive  the  difficulty 
of  ability-test  items.  Two  41-item  conventional  tests  of  verbal  ability  were 
constructed  for  administration  to  testees  in  two  ability  groups.  Testees  in 
both  the  high-  and  low-ability  groups  responded  to  each  multiple-choice  item  by 
choosing  the  correct  alternative  and  then  rating  the  item's  difficulty  relative 
to  their  levels  of  ability.  Least-squares  estimates  of  item  difficulty,  which 
were  based  on  the  difficulty  ratings,  correlated  highly  with  proportion-correct 
and  latent  trait  estimates  of  item  difficulty  based  on  a  norming  sample.  Least- 
squares  estimates  of  testee  ability,  which  were  based  solely  on  the  difficulty 
perceptions  of  the  testees,  correlated  significantly  with  number-correct  and 
maximum-likelihood  ability  scores  based  on  the  testees'  conventional  responses 
to  the  items.  These  results  show  that  item-difficulty  perceptions  were  highly 
related  to  the  "objective'4  indices  of  item  difficulty  often  used  in  test  con¬ 
struction,  and  that  as  testee  ability  level  Increased,  the  items  were  perceived 
as  being  relatively  less  difficult.  The  relationship  between  a  testee 's  ability 


and  his/her  perception  of  an  individual  item's  relative  difficulty  appeared  to 
be  weak.  Of  major  Importance  was  the  finding  that  items  which  were  appropriate 
in  difficulty  levels  from  a  psychometric  standpoint  were  perceived  by  the  tes- 
tees  as  being  too  difficult  for  their  ability  levels.  The  effects  on  testees  of 
tailoring  a  test  such  that  items  are  perceived  as  being  uniformly  too  difficult 
should  be  investigated.  (AD  A041084) 


Research  Report  77-4 

A  Rapid  Item-Search  Procedure  for  Bayesian  Adaptive  Testing 
C.  David  Vale  and  David  J.  Weiss 
May  1977 

An  alternative  item-selection  procedure  for  use  with  Owen's  Bayesian  adaptive 
testing  strategy  is  proposed.  This  procedure  is,  by  design,  faster  than  Owen's 
original  procedure  because  it  searches  only  part  (as  compared  with  all)  of  the 
total  item  pool.  Item  selections  are,  however,  identical  for  both  methods. 

After  a  conceptual  development  of  the  rapid-search  procedure,  the  supporting 
mathematics  are  presented.  In  a  simulated  comparison  with  three  item  pools,  the 
rapid-search  procedure  required  as  little  as  one-tenth  the  computer  time  as 
Owen's  technique.  (AD  A041090) 


Research  Report  78-2 

The  Effects  of  Knowledge  of  Results  and  Test  Difficulty 
on  Ability  Test  Performance  and  Psychological  Reactions  to  Testing 
J.  Stephen  Prestwood  and  David  J.  Weiss 
September  1978 

Students  were  administered  one  of  three  conventional  or  one  of  three  stradaptive 
vocabulary  tests  with  or  without  knowledge  of  results  (KR).  The  three  tests  of 
each  type  differed  in  difficulty,  as  assessed  by  the  expected  proportion  of  cor¬ 
rect  responses  to  the  test  items.  Results  Indicated  that  the  mean  maximum-like¬ 
lihood  estimates  of  individuals'  abilities  varied  as  a  joint  function  of  KR-pro- 
vlsion  and  test  difficulty.  Students  receiving  KR  scored  highest  on  the  most- 
difficult  test  and  lowest  on  the  least-difficult  test;  students  receiving  no  KR 
scored  highest  on  the  least-difficult  test  and  did  most  poorly  on  the  most- 
difficult  test.  Although  the  students  perceived  the  differences  in  test  diffi¬ 
culty,  there  were  no  effects  on  mean  student  anxiety  or  motivation  scores  at¬ 
tributable  to  difficulty  alone.  Regardless  of  test  difficulty,  students  reacted 
very  favorably  to  receiving  KR,  and  its  provision  increased  the  mean  level  of 
reported  motivation. 


Research  Report  79-7 

The  Person  Response  Curve;  Fit  of  Individuals 
to  Item  Characteristic  Curve  Models 
Tom  E.  Trabin  and  David  J.  Weiss 
December  1979 


This  study  investigated  a  method  of  determining  the  fit  of  individuals  to  item 
characteristic  curve  (ICC)  models  using  the  person  response  curve  (PRC).  The 


construction  of  observed  PRCs  is  based  on  an  individual's  proportion  correct  on 
test  item  subsets  (strata)  that  differ  systematically  in  difficulty  level.  A 
method  is  proposed  for  identifying  irregularities  in  an  observed  PRC  by  compar¬ 
ing  it  with  the  expected  PRC  predicted  by  the  three-parameter  logistic  ICC  model 
for  that  individual's  ability  level.  Diagnostic  potential  of  the  PRC  is  dis¬ 
cussed  in  terms  of  the  degree  and  type  of  deviations  of  the  observed  PRC  from 
the  expected  PRC  predicted  by  the  model. 

Observed  PRCs  were  constructed  for  151  college  students  using  vocabulary  test 
data  on  216  items  of  wide  difficulty  range.  Data  on  students'  test-taking  moti¬ 
vation.  test-taking  anxiety,  and  perceived  test  difficulty  were  also  obtained. 
PRCs  for  the  students  were  found  to  be  reliable  and  to  have  shapes  that  were 
primarily  a  function  of  ability  level.  Three-parameter  logistic  model  expected 
PRCs  served  as  good  predictors  of  observed  PRCs  for  over  90%  of  the  group.  As 
anticipated  from  this  general  overall  fit  of  the  observed  data  to  the  ICC  model, 
there  were  no  significant  correlations  between  degree  of  non-fit  and  test-taking 
motivation,  test-taking  anxiety,  or  perceived  test  difficulty.  Using  split-pool 
observed  PRCs,  a  few  students  were  identified  who  deviated  significantly  from 
the  expected  PRC. 

The  results  of  this  study  suggested  that  three-parameter  logistic  expected  PRCs 
for  given  ability  levels  were  good  predictors  of  test  response  profiles  for  the 
students  in  this  sample.  Significant  non-fit  between  observed  and  expected  PRCs 
would  suggest  the  Interaction  of  additional  dimensions  in  the  testing  situation 
for  a  given  individual.  Recommendations  are  made  for  further  research  on  person 
response  curves. 


Research  Report  80-2 

Interactive  Computer  Administration  of  a  Spatial  Reasoning  Test 
Austin  T.  Church  and  David  J.  Weiss 
April  1980 

This  report  describes  a  pilot  study  on  the  development  and  administration  of  a 
test  using  a  spatial  reasoning  problem,  the  15-puzzle.  The  test  utilized  the 
on-line  capabilities  of  a  real-time  computer  (1)  to  record  an  examinee's  prog¬ 
ress  on  each  problem  through  a  sequence  of  problem-solving  "moves”  and  (2)  to 
collect  additional  on-line  data  that  might  be  of  relevance  to  the  evaluation  of 
examinee  performance  (e.g. ,  number  of  illegal  and  repeated  moves,  response  la¬ 
tency  trends).  The  examinees,  61  students  in  an  introductory  psychology  class, 
were  required  to  type  a  sequence  of  moves  that  would  bring  one  4x4  array  of 
scrambled  numbers  (start  configuration)  into  agreement  with  a  second  4x4  array 
(goal  configuration),  using  as  few  moves  as  possible.  Data  analyses  emphasized 
the  comparison  of  several  methods  of  indexing  problem  difficulty,  methods  of 
scoring  individual  performance,  and  the  relationship  between  response  latency 
data,  performance,  and  problem-solving  strategy. 

Subjective  ratings  of  the  perceived  difficulty  of  replications  of  the  15-puzzle 
were  obtained  from  a  separate  student  sample  to  investigate  (1)  the  subjective 
dimensions  used  by  students  in  evaluating  the  difficulty  of  this  problem  type, 
(2)  how  accurately  the  actual  performance  difficulty  of  these  problems  could  be 
evaluated  by  students,  and  (3)  whether  there  were  reliable  individual  differ- 


ence8  in  difficulty  perceptions  related  to  actual  performance  differences. 

Results  of  the  study  suggested  that  four  performance  Indices  might  be  useful  in 
indexing  problem  difficulty:  (1)  mean  number  of  moves  in  the  sample,  (2)  pro¬ 
portion  of  students  solving  the  problem,  (3)  proportion  of  students  solving  the 
problem  in  the  optimal  number  of  moves,  and  (4)  a  Special  Difficulty  Index,  de¬ 
fined  as  the  sample  mean  number  of  moves  divided  by  the  minimum  number  of  moves 
required.  Four  alternative  methods  of  scoring  total  test  performance  and  two 
methods  of  scoring  individual  problem  performance  were  studied.  The  scores  that 
took  into  account  differential  numbers  of  moves  between  the  optimal  and  maximum 
number  allowed  were  related  somewhat  more  to  performance  ratings  obtained  from 
independent  judges. 

Examination  of  problem  performance  Indices,  the  Special  Difficulty  Index,  and 
students'  perceptions  of  the  difficulty  of  the  test  problems  indicated  that  most 
of  the  problems  were  too  easy  for  most  students.  However,  the  possibility  of 
obtaining  a  more  discriminating  subset  of  problems  was  suggested  by  item-total 
score  correlations  obtained  for  each  problem.  The  data  suggested  that  better 
consistency  might  be  obtained  using  problems  of  similar  difficulty  levels,  and 
it  was  hypothesized  that  an  adaptive  test  tailoring  problems  to  the  ability  lev¬ 
el  of  each  student  would  increase  the  reliability  of  measurement. 

Mean  initial  and  total  "move"  latencies  for  each  problem  were  strongly  related 
to  some  of  the  performance  indices  of  problem  difficulty.  At  the  level  of  indi¬ 
vidual  performance,  only  total  latency  or  problem  solution  time  was  related  to 
problem  performance.  Latency  data  appeared  to  confound  differences  in  the  abil¬ 
ity  to  visualize  a  sequence  of  moves  and  differences  in  students'  work  styles. 
Strong  evidence  for  these  work  styles  was  found  in  student  consistency  of  ini¬ 
tial,  average,  and  total  response  latency  measures  across  all  problems. 

Perceived  difficulty  ratings  showed  reliable  individual  differences  in  the  level 
and  variability  of  difficulty  perceptions.  The  data  suggested  that  the  individ¬ 
ual  differences  found  were  related  to  Individual  differences  in  ability  to  visu¬ 
alize  and  to  maintain  a  sequence  of  moves  in  short-term  memory.  It  was  conclud¬ 
ed  that  an  adequate  selection  of  problem  replications  should  be  able  to  tap 
these  differences,  resulting  in  reliable  solution  performance  differences. 

Improvements  in  problem  selection  and  design  were  suggested  by  the  data  in  this 
study.  Future  tests  of  this  type  should  consist  of  fewer  but  more  difficult 
problems,  particularly  problems  not  permitting  reactive,  impulsive  solutions. 
This  type  of  test  would  seem  especially  appropriate  for  adaptive  administra¬ 
tion:  (1)  scores  on  problems  tailored  to  the  individual's  ability  would  likely 
be  more  highly  related  to  each  other,  resulting  in  more  highly  reliable  total 
scores;  (2)  the  motivational  aspects  of  the  tests,  which  seem  more  taxing  and 
potentially  frustrating  than  conventional  item  formats,  would  likely  be  im¬ 
proved,  and  (3)  for  most  testees  equally  precise  measurements  could  be  obtained 
in  shorter  periods  of  time  than  with  conventional  test  administration. 


Research  Report  80-3 

Criterion-Related  Validity  of  Adaptive  Testing  Strategies 
Janet  G.  Thompson  and  David  J.  Weiss 
June  1980 

Criterion-related  validity  of  two  adaptive  tests  was  compared  with  a  convention¬ 
al  test  in  two  groups  of  college  students.  Students  in  Group  1  (N  -  101)  were 
administered  a  stradaptive  test  and  a  peaked  conventional  test;  students  in 
Group  2  (N  ■  131)  were  administered  a  Bayesian  adaptive  test  and  the  same  peaked 
conventional  test.  All  tests  were  computer-administered  multiple-choice  vocabu¬ 
lary  tests;  items  were  selected  from  the  same  pool,  but  there  was  no  overlap  of 
items  between  the  adaptive  and  conventional  tests  within  each  group.  The  strad¬ 
aptive  test  item  responses  were  scored  using  four  different  methods  (two  mean 
difficulty  scores,  a  Bayesian  score,  and  maximum  likelihood)  with  two  different 
sets  of  item  parameter  estimates,  to  study  the  effects  on  criterion-related  va¬ 
lidity  of  scoring  methods  and/or  item  parameter  estimates.  Criterion  variables 
were  high  school  and  college  grade-point  averages  (GPA),  and  scores  on  the  Amer¬ 
ican  College  Testing  Program  (ACT)  achievement  tests. 

Results  indicated  generally  higher  validities  for  the  adaptive  tests;  at  least 
one  method  of  scoring  the  stradaptive  tests  resulted  in  higher  correlations  than 
the  conventional  test  with  seven  of  the  eight  criterion  variables  (and  equal 
correlations  for  the  eighth),  even  though  the  stradaptive  test  administered  over 
25%  fewer  items,  on  the  average,  than  did  the  conventional  test.  The  stradap¬ 
tive  test  obtained  a  significantly  higher  correlation  with  overall  college  GPA 
(r  -  .27)  than  did  the  conventional  test;  when  oath  GPA  was  partialled  from 
overall  GPA,  the  maximum  correlation  for  the  stradaptive  test  with  an  average 
length  of  29.2  items  was  r  -  .51,  while  the  40-item  conventional  test  correlated 
only  .36.  The  data  showed  generally  higher  criterion-related  validities  for  the 
mean  difficulty  scores  on  the  stradaptive  test  in  comparison  to  the  Bayesian  and 
maximum  likelihood  scores;  the  different  item  parameter  estimates  had  no  effect 
on  validity,  resulting  in  scores  that  correlated  .98  with  each  other. 

Although  the  mean  length  of  the  Bayesian  adaptive  test  was  48.7  items,  the  medi¬ 
an  number  of  items  (35)  was  less  than  that  of  the  40-item  conventional  test. 
Ability  estimates  from  this  adaptive  test  also  correlated  higher  with  seven  of 
the  eight  criterion  variables  than  did  scores  on  the  conventional  tests,  al¬ 
though  none  of  the  differences  were  statistically  significant. 

These  data  indicate  that  adaptive  tests  can  achieve  criterion-related  validities 
equal  to,  and  in  some  cases  significantly  greater  than,  those  obtained  by  con¬ 
ventional  tests  while  administering  up  to  27%  fewer  items,  on  the  average.  The 
data  also  suggest  that  latent-trait-based  scoring  of  stradaptive  tests  may  not 
be  optimal  with  respect  to  criterion-related  validity.  Limitations  of  the  study 
are  discussed  and  suggestions  are  made  for  additional  research.  (AD  A087595) 


Research  Report  80-5 

An  Alternate-Forms  Reliability  and  Concurrent  Validity 
Comparison  of  Bayesian  Adaptive  and  Conventional  Ability  Tests 
G.  Gage  Kingsbury  and  David  J.  Weiss 
December  1980 

Two  30-item  alternate  forms  of  a  conventional  test  and  a  Bayesian  adaptive  test 
were  administered  by  computer  to  472  undergraduate  psychology  students.  In  ad¬ 
dition,  each  student  completed  a  120-item  paper-and-pencil  test,  which  served  as 
a  concurrent  validity  criterion  test,  and  a  series  of  very  easy  questions  de¬ 
signed  to  detect  students  who  were  not  answering  conscientiously.  All  test 
items  were  five-alternative  multiple-choice  vocabulary  items.  Reliability  and 
concurrent  validity  of  the  two  testing  strategies  were  evaluated  after  the  ad¬ 
ministration  of  each  item  for  each  of  the  tests,  so  that  trends  indicating  dif¬ 
ferences  in  the  testing  strategies  as  a  function  of  test  length  could  be  detect¬ 
ed.  For  each  test,  additional  analyses  were  conducted  to  determine  whether  the 
two  forms  of  the  test  were  operationally  alternate  forms. 

Results  of  the  analysis  of  alternate-forms  correspondence  indicated  that  for  all 
test  lengths  greater  than  10  items,  each  of  the  alternate  forms  for  the  two  test 
types  resulted  in  fairly  constant  mean  ability  level  estimates.  When  the  scor¬ 
ing  procedure  was  equated,  the  mean  ability  levels  estimated  from  the  two  forms 
of  the  conventional  test  differed  to  a  greater  extent  than  those  estimated  from 
the  two  forms  of  the  Bayesian  adaptive  test. 

The  alternate-forms  reliability  analysis  indicated  that  the  two  forms  of  the 
Bayesian  test  resulted  in  more  reliable  scores  than  the  two  forms  of  the  conven¬ 
tional  test  for  all  test  lengths  greater  than  two  items.  This  result  was  ob¬ 
served  when  the  conventional  test  was  scored  either  by  the  Bayesian  or  propor¬ 
tion-correct  method. 

The  concurrent  validity  analysis  showed  that  the  conventional  test  produced 
ability  level  estimates  that  correlated  more  highly  with  the  criterion  test 
scores  than  did  the  Bayesian  test  for  all  lengths  greater  than  four  items.  This 
result  was  observed  for  both  scoring  procedures  used  with  the  conventional  test. 

Limitations  of  the  study,  and  the  conclusions  that  may  be  drawn  from  it,  are 
discussed.  These  limitations,  which  may  have  affected  the  results  of  this 
study,  included  possible  differences  in  the  alternate  forms  used  within  the  two 
testing  strategies,  the  relatively  small  calibration  samples  used  to  estimate 
the  ICC  parameters  for  the  items  used  in  the  study,  and  method  variance  in  the 
conventional  tests.  (AD  A09447-7) 


Research  Report  81-2 

Effects  of  Immediate  Feedback  and  Pacing  of  Item  Presentation 
on  Ability  Test  Performance  and  Psychological  Reactions  to  Testing 
Marilyn  F.  Johnson,  David  J.  Weiss,  and  J.  Stephen  Prestwood 

February  1981 


The  study  investigated  the  joint  effects  of  knowledge  of  results  (KR  or  no-KR) 
pacing  of  item  presentation  (computer  or  self-pacing),  and  type  of  testing 


strategy  (50-item  peaked  conventional,  variable-length  stradaptive,  or  50-item 
fixed-length  stradaptive  test)  on  ability  test  performance,  test  item  response 
latency,  information,  and  psychological  reactions  to  testing.  The  psychological 
reactions  to  testing  were  obtained  from  Likert-type  items  that  assessed  test¬ 
taking  anxiety,  motivation,  perception  of  difficulty,  and  reactions  to  knowledge 
of  results.  Data  were  obtained  from  447  college  students  randomly  assigned  to 
one  of  the  12  experimental  conditions. 

The  results  indicated  that  there  were  no  effects  on  ability  estimates  due  to 
knowledge  of  results,  testing  strategy,  or  pacing  of  item  presentation.  Al¬ 
though  average  latencies  were  greater  on  the  stradaptive  tests  than  on  the  con¬ 
ventional  test,  the  overall  testing  time  was  not  substantially  longer  on  the 
adaptive  tests  and  may  have  been  a  function  of  differences  in  test  difficulty. 
Analysis  of  information  values  indicated  higher  levels  of  information  on  the 
stradaptive  tests  than  on  the  conventional  test.  There  was  no  statistically 
significant  main  effect  for  any  of  the  three  experimental  conditions  when  test 
anxiety  or  test-taking  motivation  were  the  dependent  variables,  although  there 
were  some  significant  interaction  effects. 

These  results  indicate  that  testing  conditions  may  interact  in  a  complex  way  to 
determine  psychological  reactions  to  the  testing  environment.  The  interactions 
do  suggest,  however,  a  somewhat  consistent  standardizing  effect  of  KR  on  test 
anxiety  and  test-taking  motivation.  This  standardizing  effect  of  KR  showed  that 
approximately  equal  levels  of  motivation  and  anxiety  were  reported  under  the 
various  testing  conditions  when  KR  was  provided,  but  that  mean  levels  of  these 
variables  were  substantially  different  when  KR  was  not  provided.  Consistent 
with  theoretical  expectations,  the  conventional  test  was  perceived  as  being 
either  too  easy  or  too  difficult,  whereas  the  adaptive  tests  were  perceived  more 
often  as  being  of  appropriate  difficulty. 

The  results  concerning  the  effects  of  KR  on  test  performance,  motivation,  and 
anxiety  found  in  this  study  were  contrary  to  earlier  reported  findings;  and  dif¬ 
ferences  in  the  studies  are  delineated.  Recommendations  are  made  concerning  the 
control  of  specific  testing  conditions,  such  as  difficulty  of  the  test  and  abil¬ 
ity  level  of  the  examinee  population,  as  well  as  suggestions  for  the  further 
analysis  of  the  standardizing  effect  of  KR. 


Research  Report  83-1 

Reliability  and  Validity  of  Adaptive  and  Conventional  Tests 
In  a  Military  Recruit  Population 
John  T.  Martin,  James  R.  McBride,  and  David  J.  Weiss 

January  1983 

A  conventional  verbal  ability  test  and  a  Bayesian  adaptive  verbal  ability  test 
were  compared  using  a  variety  of  psychometric  criteria.  Tests  were  administered 
to  550  Marine  recruits,  half  of  whom  received  two  30-item  alternate  forms  of  a 
conventional  test  and  half  of  whom  received  two  30-item  alternate  forms  of  a 
Bayesian  adaptive  test.  Both  types  of  tests  were  computer  administered  and  were 
followed  by  a  50-item  conventional  verbal  ability  criterion  test. 

The  alternate  forms  of  the  adaptive  test  resulted  in  scores  that  were  much  more 


similar  in  means  and  variances  than  were  the  conventional  tests  for  Which  most 
means  and  variances  for  various  test  lengths  were  significantly  different. 
Adaptive  testing  resulted  in  significantly  higher  alternate  forms  reliability 
correlations  for  all  test  lengths  through  19  items;  reliability  of  a  9-item 
adaptive  test  was  equal  to  that  of  a  17-item  conventional  test.  Validity  corre¬ 
lations  were  higher  for  the  adaptive  procedure  for  all  test  lengths.  Validity 
of  an  11-item  adaptive  test  was  equal  to  that  of  a  27-item  conventional  test,  in 
spite  of  lower  discriminating  items  being  used,  on  the  average,  by  the  adaptive 
tests  in  comparison  to  the  conventional  test.  Very  few  of  the  recruits  had  dif¬ 
ficulty  in  responding  to  the  computer-administered  instructions  on  use  of  the 
testing  terminals.  Analysis  showed  some  differences  in  test  duration  between 
the  two  testing  strategies;  where  they  occurred,  they  were  explained  by  the 
ability  level  of  the  examinees,  i.e. ,  higher  ability  examinees  who  were  adminis¬ 
tered  adaptive  tests  received  more  difficult  items  and  therefore  had  signifi¬ 
cantly  longer  testing  times.  Combined  with  reduced  test  length  for  the  adaptive 
test  to  obtain  similar  reliabilities  and  validities  to  the  conventional  test, 
however,  the  slight  Increases  observed  in  adaptive  testing  time  were  negligible. 

The  data  support  the  feasibility  of  adaptive  testing  with  military  recruit  popu¬ 
lations  and  support  theoretical  predictions  of  the  psychometric  superiority  of 
adaptive  tests  in  comparison  with  number-correct  scored  conventional  tests. 
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